---
title: llama.cpp Engine
description: Configure Jan's local AI engine for optimal performance.
keywords:
  [
    Jan,
    local AI,
    llama.cpp,
    AI engine,
    local models,
    performance,
    GPU acceleration,
    CPU processing,
    model optimization,
  ]
---

import { Tabs } from 'nextra/components'
import { Callout, Steps } from 'nextra/components'
import { Settings } from 'lucide-react'

# Local AI Engine (llama.cpp)

`llama.cpp` is the core **inference engine** Jan uses to run AI models locally on your computer. This section covers the settings for the engine itself, which control *how* a model processes information on your hardware.

<Callout>
Looking for API server settings (like port, host, CORS)? They have been moved to the dedicated [**Local API Server**](/docs/desktop/api-server) page.
</Callout>

## Accessing Engine Settings

Find llama.cpp settings at **Settings** (<Settings width={16} height={16} style={{display:"inline"}}/>) > **Local Engine** > **llama.cpp**:

![llama.cpp](./_assets/llama.cpp-01-updated.png)

<Callout type="info">
Most users don't need to change these settings. Jan picks good defaults for your hardware automatically.
</Callout>

## When to Adjust Settings

You might need to modify these settings if:
- Models load slowly or don't work
- You've installed new hardware (like a graphics card)
- You want to optimize performance for your specific setup

## Engine Management

| Feature | What It Does | When You Need It |
|---------|-------------|------------------|
| **Engine Version** | Shows current llama.cpp version | Check compatibility with newer models |
| **Check Updates** | Downloads engine updates | When new models require updated engine |
| **Backend Selection** | Choose hardware-optimized version | After hardware changes or performance issues |

## Hardware Backends

Different backends are optimized for different hardware. Pick the one that matches your computer:

<Tabs items={['Windows', 'Linux', 'macOS']}>

<Tabs.Tab>

### NVIDIA Graphics Cards (Fastest)
**For CUDA 12.0:**
- `llama.cpp-avx2-cuda-12-0` (most common)
- `llama.cpp-avx512-cuda-12-0` (newer Intel/AMD CPUs)

**For CUDA 11.7:**
- `llama.cpp-avx2-cuda-11-7` (older drivers)

### CPU Only
- `llama.cpp-avx2` (modern CPUs)
- `llama.cpp-avx` (older CPUs)
- `llama.cpp-noavx` (very old CPUs)

### Other Graphics Cards
- `llama.cpp-vulkan` (AMD, Intel Arc)

</Tabs.Tab>

<Tabs.Tab>

### NVIDIA Graphics Cards
- `llama.cpp-avx2-cuda-12-0` (recommended)
- `llama.cpp-avx2-cuda-11-7` (older drivers)

### CPU Only
- `llama.cpp-avx2` (modern CPUs)
- `llama.cpp-arm64` (ARM processors)

### Other Graphics Cards
- `llama.cpp-vulkan` (AMD, Intel graphics)

</Tabs.Tab>

<Tabs.Tab>

### Apple Silicon (M1/M2/M3/M4)
- `llama.cpp-mac-arm64` (recommended)

### Intel Macs
- `llama.cpp-mac-amd64`

<Callout type="info">
Apple Silicon automatically uses GPU acceleration through Metal.
</Callout>

</Tabs.Tab>

</Tabs>

## Performance Settings

| Setting | What It Does | Recommended | Impact |
|---------|-------------|-------------|---------|
| **Continuous Batching** | Handle multiple requests simultaneously | Enabled | Faster when using tools or multiple chats |
| **Parallel Operations** | Number of concurrent requests | 4 | Higher = more multitasking, uses more memory |
| **CPU Threads** | Processor cores to use | Auto | More threads can speed up CPU processing |

## Memory Settings

| Setting | What It Does | Recommended | When to Change |
|---------|-------------|-------------|----------------|
| **Flash Attention** | Efficient memory usage | Enabled | Leave enabled unless problems occur |
| **Caching** | Remember recent conversations | Enabled | Speeds up follow-up questions |
| **KV Cache Type** | Memory vs quality trade-off | f16 | Change to q8_0 if low on memory |
| **mmap** | Efficient model loading | Enabled | Helps with large models |
| **Context Shift** | Handle very long conversations | Disabled | Enable for very long chats |

### Memory Options Explained
- **f16**: Best quality, uses more memory
- **q8_0**: Balanced memory and quality
- **q4_0**: Least memory, slight quality reduction

## Quick Troubleshooting

**Models won't load:**
- Try a different backend
- Check available RAM/VRAM
- Update engine version

**Slow performance:**
- Verify GPU acceleration is active
- Close memory-intensive applications
- Increase GPU Layers in model settings

**Out of memory:**
- Change KV Cache Type to q8_0
- Reduce Context Size in model settings
- Try a smaller model

**Crashes or errors:**
- Switch to a more stable backend (avx instead of avx2)
- Update graphics drivers
- Check system temperature

## Quick Setup Guide

**Most users:**
1. Use default settings
2. Only change if problems occur

**NVIDIA GPU users:**
1. Download CUDA backend
2. Ensure GPU Layers is set high
3. Enable Flash Attention

**Performance optimization:**
1. Enable Continuous Batching
2. Use appropriate backend for hardware
3. Monitor memory usage

<Callout type="info">
The default settings work well for most hardware. Only adjust these if you're experiencing specific issues or want to optimize for your particular setup.
</Callout>
